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Abstract 

Various molecular interaction networks have been claimed to follow power-law decay for 
their global connectivity distribution. It has been proposed that there may be underlying 
generative models that explain this heavy-tailed behavior by self-reinforcement processes 
such as classical or hierarchical scale-free network models. Here we analyze a comprehensive 
data set of protein-protein and transcriptional regulatory interaction networks in yeast, an 
E. coli metabolic network, and gene activity profiles for different metabolic states in both 
organisms. We show that in all cases the networks have a heavy-tailed distribution, but 
most of them present significant differences from a power-law model according to a stringent 
statistical test. Those few data sets that have a statistically significant fit with a power-law 
model follow other distributions equally well. Thus, while our analysis supports that both 
global connectivity interaction networks and activity distributions are heavy-tailed, they are 
not generally described by any specific distribution model, leaving space for further inferences 
on generative models. 
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1 Introduction 



Over the last few years it has been extensively argued that the connectivity distributions of 
various molecular i nteraction networks follow power-law distri butions that are bes t approxi- 
mated by classical ( iBarabasi and Albertl. Il999l) or hierarchical (jRavasz et all |2002| ) scale- free 
network models (reviewed in Refs. flAlbertl . 120051 : iBarabasil . 12004^ . Although this does not 
imply that power-law distributions do in fact describe the observed degree distributions, it 
has been used as motivation for developing generative models that yield power laws. More- 
over, power-law models also appear to charact erize the distr i butio n of activity levels. For 
example, both c alculated metabol ic flux values (lAlmaas et all 120041 ). and measured gene ex- 



pression values (jUeda et all . 120041 ) seem to follow such a distribution, which is in agreement 



with theoretical predictions for load distribution on scale-free networks ( lAlmaas et al.1 . 12004 



Goh et all . 1200 lh . 



Using a stat i stical framework similar to the one employed in this paper, a recent work 
( Edwards et al.1 . 120071 ) revisited several enhanced datasets of foraging patterns in wild an- 
imals and concluded that the distributions previously classified as power-law were better 
described by a Gamma distribution. The previous spurious results were attributed in part 
to the limited magnitude of the dataset (also typical of biological datasets) and to the 
graphical method used t o assess the charact e r of the distribution s. It has been suggested 



( Khanin and Wit . 2006 : Stumpf et al. . 20071 : Tanaka et al. . 2005 ) that the charactheristic 



observation that biological networks follow a power-law distribution may have been reached 
due to the same methodological shortcomings. Specifically, it has been argued that analysing 
relatively small cellular networks, having only a few hundred to a few thousands data el- 
ements using the commonly employed frequency-degree or intensity plots, does not have 
sufficient power to differentiate among various network models having heavy-tailed distri- 
butions, and that the use of rank-deg r ee (intensity) plots proves superior for this purpose 



( iKhanin and Witl . l2006t IStumpf et al.l . l2007t iTanaka et al.l . 120051 ). Furthermore, a rigorous 



quantification of goodness-of-fit is also required to establish relatedness to various network 
models, and the quality of the underlying data set (i.e., the quality of network reconstruction) 
is critical for proper analysis. 



1.1 The Models and the MLE Analytical Framework 

Assessment of the best model explaining a given data distribution has typically been done 
using simple linear regression methods. These methods are suitable for normal distribution 
functions, but not for highly skewed distribution functions. Essentially, the problem arises 
from the fact that skewed distributions are characterized by the scale of the tail, which forms 
most of the support for the distribution but that is barely populated (i.e. contains less than 
10% of the data points). Because of this, simple least square fits of the probability density 
distribution com puted via histogram methods are very poor estimators of the distribution 



parameters (see (ITanaka et al.l . 120051 ) for a review of possible problems) due to the noisy 



poor sampling of the tails. 
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Therefore, it has been argued that density plots should not be used as a base for the fitting 
of these types of data, as several more reliable methods are available. A sim ple and better 
strat egy is to use rank-plots as commonly used i n engineering and economics (ITanaka et al. 



20051 ). Logarithmic binning (lAlbert et al.l . Il999h has also been used as a more robust alter- 



native, but it has been reported that this procedure fail s to retrieve the value of the exponent 



as the slope of the graph for power-law distributions (ITanaka et al.l . 120051 ; iGoldstein et al. 



2004 ; IClauset et al.l . 120091 ) . Finally, one could appl y a logarithmic transfor mation to the 



data and fit the corresponding distribution function (IDe Fabritiis et al.l . 120031 ) thus avoiding 
the problem of skewed data from the outset. In this case the appropriate transformation 
of the probability distribution has to be performed (for instance a normal distribution for 
log-transformed, log- normally distributed data), a procedure which could be cumbersome 
for some distribution functions. 

Below we focus on the discrimination of different models (see Section I2~2"j) for several types 
of molecular interaction and expression data sets by using probability-based techniques to 
perform comparative analyses. In order to have a good mathematical representation of 
the probability distri bution we use cumu lative distribution functions (cdf) that are directly 
related to rank-plots (ITanaka et al.l . 120051 ) . In addition, instead of graphical-based estimation 
methods we utilize maximum likelihood estimation (MLE) (see Section [2]), which is not 
dependent on the gr aphical representation and allows us to perform a statistical test of 
the proposed model s ( Goldstein et al. . 2004 : Stumpf and Ingram . 2005 : Hoogenboom et al. 



2006; Clauset et al. 



2009h . 



Based on these considerations, here we reexamine both the global topological connec- 
tivity distributions and absolute gene and protein expression value distributions in cellular 
networks, with a focus on the most completely reconstructed, highest quality data sets. We 
have tested the empirical distributions against several plausible models with heavy-tailed dis- 
tribution. Note also tha t generative models in the context of biological networks have been 



proposed for power-law dBarabasi and Albert 



199 91; iQian et al.l . l200ll ; lRzhetskv and Gomez . 



20011 ; iBhan et all . 120021 ; iPastor-Satorras et all . 120031 ) and broad-scale ( lAmaral et al.l . 120001 ) 
distributions but not for the rest of the tested distributions. We examine the extent to 
which several types of distributions, some with plausible underlying generative models, can 
provide a similar probabilistic framework for the experimental data being analyzed. We 
show that for the analyzed molecular interaction distributions, only the high-throughput 
protein-protein interaction data set and the out-degree distribution of the transcriptional 
regulatory network significantly fitted the power-law model. In the case of the distributions 
of global gene and protein expression values, only two mid-log cultures of E. coli showed 
a statistically significant fit. At any rate, all experimental data sets having significant fits 
for the power-law distribution, showed equally good fits for other distributions which leads 
to inconclusive results for the support of a single specific distribution. We discuss the im- 
plications of this finding for the uniqueness of generative models solely from topology and 
activity data distributions. 
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2 Materials and Methods 



For model clas s ificati on, we have basically used the methods developed and implemented in 



(IClauset et all 120091 ). However, we provide here sufficient details about the mathematical 



foundations to make the analytical procedure clear. 



2.1 Probability Based Estimation Method 



The likelihood of a given probability distribution p„ 
describing a given dataset of independent data (xi, . 

N 



{x^ depending on parameters v and 
.,xn), is defined as l(v\xi, ...,xn) = 



Our datasets are built of the (discrete) number of edges in interaction networks 



or the (continuous) amount of expression of the nodes in activity networks. Therefore, from 
a group of edges or nodes (xi, ...,xn), we obtain the probability associated with each value 
(p v (xi), ...,p v (x N )). 

Here we use different model distributions with their corresponding set of parameters v. 
In each case a maximum likelihood estimation (MLE) based on the log-likelihood function, 
L(v|x) = log l(v\x), is used to compute the set of parameters v opt which maximizes L. 
A special case represents the parameter x m i n which is the lower bound of the power-law 
distribution (see Section SI 2) . For its calculation we followed the alternative procedures 
developed and implemented at (IClauset et al.l . 120091 ) . Once the vector v opt has been obtained, 



we move on to find if there are differences between the empirical data and the model. An 
intuitive approach would be to perform a Kolmogorov- Smirnov (KS) test usin g the empirical 



distribution set and the estimated model distribution (IGoldstein et all l2004l ). However the 



two distributions are not independent, which is one of the required premises of the KS test, 
because the model was estimated from the same data with which we want to perform the test. 
Ins tead, a Monte Carlo protocol can be used to avoid such direct comparison. As described 



in (IClauset et al.l . |2009j) , the KS statistic D is calculated for many Monte Carlo generated 



data sets from the estimated distribution. Our p-value will be simply the fr action of times 



that D for the empirical data is larger than D for the generated data sets (IClauset et al 



20091 ). We define the null hypothesis H as, no statistical difference between the data and 
the model. In our case, the confidence valu e of p = 0.1 wi l l in fact be more appropriat e than 



p = 0.05 for the statistical test, following (IClauset et all 120091 ; iMavo and Cox! . 120061 ). The 



rationale is that we are looking for differences in only one tail of the distribution, we look 
for p- values that are larger than that associated with the experimental fit. 



2.2 Distribution Functions Estimation 

Both continuous and discrete model distributions are used depending on the nature of the 
data analyzed in each case. As described in Section 12.2. lj. we have tested seven models 
of probability distribution against the studied data. In turn, five model distributions were 
tested against the continuous data sets: power-law, log-normal, exponential, Weibull and 
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broad-scale. It is worth noting that only some of the distributions tested (e.g., power-law and 
broad-scale) have been pu t forward as distribut i ons agreeing with plausible generative models 



in the biological context ( Barabasi and Albert . 1999 ; Amaral et al. . 2000h . The intention of 



this work is to expand the application of the methods to other a priori suitable distributions 
to give a more general scope to the study. 



2.2.1 Distribution Functions 



Power-law model: A random variable X is said to follow a power-law distribution with 
index a when the scaling law of P[X > x] is power-law, such as P[X > x] = 1 — P[X < 
x] ~ cx~ a for x— > oo. We used the Pareto distribution for the continuous data: probability 
density function (pdf): p(x) = ^ (^p - ) , cumulative distribution function (cdf): P[X < 

-a+l 

and for the discrete case we used Zipf's law with the following terms, 
cdf: P[K < k] — y. f N . The Hurwitz zeta function is defined as 



x\ 



pdf: p(k) 



C (a, k) = ^2(n + k) a . The reader is referred to ( Goldstein et al. . 20041 ) and ( Clauset et al 



n=0 



20091 ) for further explanations. Specifically, fo r the d etermination of x 



used the methods described in (IClauset et all |2009i ). 



(or k min ), we have 



Log-normal model: A random variable X is log- normally distributed if Y = log(X) follows 
a normal distribution. We have used the following distributions for the continuous case: pdf: 

P( x ) = " „ 1 /or e ~ 2g2 ) c df: P[X < x] = \ + |erf ( ^~^^ )- Numerical approximations of 



( IClauset et all 120091 ) have been used for the discrete models. 

Poisson model: We have used the discrete Poisson distribution to model the discrete data 
sets using the following probability mass function (pmf): f(k, A) = — . 

Yule model: Again this is a discrete distribution that we have only used for discrete 
data sets. We used the following forms, pdf: p(k) = pQ{k,p+ 1) and cdf: P[K < k] = 
1 — k~B(k, p + 1), where p > and B is the Beta function. 

Exponential model: Exponential models have no biological relevance in the framework of 
our study. However, we have included it in some cases for methodological comparison as a 
representative model of a non fat-tailed distribution. For a random variable X distributed 
exponentially we have used the following forms: pdf: p(x) = Xe~ Xx , cdf: P[X < x] = 1 — e 



Ax 



e as the 



for the continuous data sets; we have taken the pmf: 
discrete model. 

Weibull model: Also referred to as stretched exponential, we have used pdf: p(x) = 
{k/X){x/X) k ~ l e~^ x / x ^ and cdf: P[X < x] — 1 — e~( X//A ) for the continuous data sets. For 



the discrete data we have used the code avail able from (IClauset et all 120091 ) that uses the 
Nakagawa-Osaki (INakagawa and Osakil . Il975l ) method for the discretization of the Weibull 
distribution, pmf: f(k;q,(3) = q kf> . 

Broad-scale model: Broad-scale distribution functions, also referred to here as power- 
law with cut-off or power-law plus exponential, identify a class of distributions that are 



characterized by a scaling law p(x) ~ ce 



- A,r 



x 



for x — > oo. Note that the broad-scale 
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is very similar to the Gamma distribution. Both contain a shape and a scale parameter. 
The Gamma distribution in addition contains a normalizing constant. For the Gamma 
distribution, the pdf would be p(x) = j^ye _/3x x Q_1 , being T(x) = (a — 1)! for a > 0. 
The broad-scale distribution is a type of nested function, which implies a specific statistical 
treatment in Section [2731 In the current case the broad-scale consists of a power-law behavior 
and after a given thr eshold it has an exponential decay. Numerical routines available from 
( IClauset et all . 120091 ) have been used to calculate the corresponding pdf and cdf. 



2.3 Distribution Comparison 

To compare the feasibility of the different models with respect to the power-law distribution, 
we applied a likelihood ratio test, which compares the fits of two given competing distri- 
butions. Note that we are not comparing a model against the empirical distribution but 
two different distributions with each other. Strictly speaking in this case we compare differ- 
ent models with each other to indirectly assess ho w well the distributio ns exp lain the data . 



We have again applied the methods developed in ( IClauset et all 120091 ) and ( IVuongl . 119891 ) 



to evaluate the normalized log-likelihood ratio (NLLR). If the NLLR has a positive value 
the power-law model is supported, whereas the alternative distribution offers a better fit if 
NLLR has a negative value. In order to determine significant positive or negative values, we 
calculated the associated p-value. Given the different experimental design with respect to 
the one d escribed in Sect i on 12.11 we take here as confidence value the most commonly used 
p = 0.05 ( iMayo and Coxl . 120061 ). We search for differences on both sides of the distribution 
and therefore we use p = 0.05 as level of significance. 

The broad-scale distribution requires a special comment, as it is a nested function contain- 
ing both power-law and exponential terms. In fact, if we compare the power-law distribution 
fit with the broad-scale distribution fit, the NLLR test will always be zero or negative, fa- 
voring the broad-scale, as it reproduces the power-law behavior plus an additional term. To 
solve this problem, we used a correct ion of the p- value ca l culat ed for the log-likelihood ratio 
(LLR) as described in (IVuongL Il989l ) and (IClauset et all |2009|). 



2.4 Data Sets Used 

Molecular interaction maps were obtained from different sources: S. cerevisiae protein 



protein i nteraction networks (IReguly et all 120061 ). S. cerevis iae transcriptiona l regulatory 



network fjMacIsaac et all . 120061 ). E. coli metabolic networks (jFeist et all 120071 ). Yeast ex- 



pression values were taken from flGhaemmaghami et all 120031 ). Global mRNA expression 



data for E . coli c ells and protein expression data for S. cerevisiae cells were obtained from 
(iBeg et allE007h . 



6 



3 Results 



3.1 Global Topological Organization of Molecular Interaction Net- 
works 

The aim of this paper is to test by means of a probabilistic approach the ability of heavy-tailed 
distribution functions to explain the experimental data distributions for given datasets. Due 
to the nature of the experimental data being considered here, we have differentiated between 
global interaction networks and global activity networks. Global interaction measurements 
refer to the number of interactions of a given molecule. Thus, measurements are counted as 
natural numbers and the studied models will be discrete. On the other hand, global activity 
interactions are defined in terms of cellular expression, measurements are expressed using 
positive real numbers and the applied models will be continuous. 



3.1.1 Global Interaction Networks 



To assess the degree distribution of an undirected homotypic molecular inte r action network 
we used data obtained from baker's yeast, S. cerevisiae, in ( jRegulv et al.l . 120061 ) that, to 
date, is the most comprehensive information for a species-specific protein-protein interaction 
(PPI) network. We analyzed two data sets built using two different methodologies. The first 
data set was created from the combination of five different stu dies based on exper imental 



high-throughput techniques, which we refer to as HTP data (jRegulv et all 120061 1 . The 



second data set was curated manu ally from the literat ure (LC data set), and is assumed to 
be more accurate than the former (jRegulv et all 120061 ) . After the removal of redundant and 
self-interactions ( jRegulv et al.l . 120061 ) . we obtained a data set of 11,571 interactions between 
4,474 proteins for the HTP data and 8,165 interactions between 2,689 proteins for the LC 
data. Table [T] shows that after fitting the parameters using an MLE, only the HTP data 
set provides a statistically sufficient goodness-of-fit, based on the KS test (see Materials and 
Methods). Whether or not the high rate of false- discovery interactions from yeast-two hybrid 
experiments in the HTP data set leads to the acceptance of th e null hypothesis for the power- 
law model is an open issue for discussion ( Huang et al. . 2007t ) . In fact, for the same network, 
if the data is retrieved manually from the literature (LC data set), the power-law model 
is rejected. Furthermore, for the case of HTP data, exponential and Poisson distributions 
behave significantly worse than the power-law model while for the LC data set, log-normal, 
Yule, Weibull and broad-scale are better models than the power-law. Fig. la shows the 
degree distribution of the LC network with its corresponding best fitted power-law model. 
An equivalent plot for the HTP data set is shown in Fig. SI la. 

We also analyzed the degree distribution of a directed heterotypic m olecular interac- 



tion network, the transcriptional regulatory (TR) network of S. cerevisiae ( jMacIsaac et al. 
20061 ). the edges of which represent interactions between transcription factors (TF) and 



gene regulatory regions, distinguishing out-degree from in-degree interactions. In total the 
data includes 99 TFs and 1,851 TF-regulated genes connected through 3,394 links. Tabled] 
clearly indicates that the degree distribution of out-degree interactions (TF— >-gene) provide 
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a statistically significant fit for the power-law model. However, all other tested distributions 
except the Poisson fit the empirical data equally well which prevents us from establishing 
unequivocally the power-law distribution as the statistical model to describe the data. This 
result should be take n with caution a s the s ample size (N = 99) is really on the limit to be 
considered too small f lHart and Clarkl . Il999l ). For the case of in-degree interactions (number 
of TF affecting a given gene) the range of connectivities is too small to provide conclusive 
results, although it appears that all other tested models apart from the Poisson model offer 
significantly better fits than the power-law. A graphical representation of the out-degree 
distribution with their best-fitted power-law model is shown in Fig. lb. Fig. SI lc shows 
the corresponding plot for the in-degree distribution. 

Fi nally, we assessed the recent high quality reconstruction of the E. coli metabolic net- 
work ( jFeist et al.l . 120071 ). in which substra tes are connected t o each other through edges that 
represent the actual metabolic reactions ( Ueong et al.l . 120001 ) with a total of 2,381 reactions, 
of which 304 are exchange, 553 reversible and 1,524 irreversible reactions. We have studied 
separately the in-degree distribution with 1,657 nodes and 3,050 links and the out-degree 
distribution with 1,656 metabolites and 2,788 links. Both networks behave very similarly: 
both display significant differences with the power-law model. While the log-normal model 
fits the data equally as well as the power-law model, all other tested models behave signif- 
icantly worse. The empirical distributions with their corresponding best-fitted power-law 
models are represented graphically in Fig. SI le,f. 



3.1.2 Clustering Coefficients 

To gain further insight into the topological organization of molecular interaction networks, 
we have analyzed the C(k) vs . k re lationships for the HTP and LC re constructed PPI 



netw orks of yeast ( iReguly et al.l . 120061 ) , the T R regulatory netwo rk of yeast ( iMacIsaac et al. 



20061 ). and the metabolic network of E. coli ( jFeist et al.l . 120071 ) . Fig. SI 2 shows how the 



dependence of the clustering coefficient in the three types of molecular interaction networks is 



not uniform. While the E. coli metabolic network dis 



plays a distribution close to C{k) = k~ 



(Fig. SI 2d), as previously described (IRavasz et al.l . 120021 ). the distribution observed in the 
other two networks is significantly less organized. 



3.2 Global Activity Level Distributions in Molecular Interaction 
Networks 

Next we assessed the distribution of activity levels achieved on the underlying network 
topology. First, we examined single time point snapshots for mRNA and protein expression 
in E. coli and in S. cerevisiae, respectively. 

We used global gene expression data from E. coli cells grown in mid-log phase batch cul- 
tures with single carbon sourc e media, using ac etate, galactose, glucose, glycerol and maltose 
as individual carbon sources (IBeg et al.l . 120071 ). Signals for 3,977 probes were examined for 
the expression level of the corresponding genes based on their hybridization intensity (Table 
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[2]). Activity distribution of acetate growing cells significantly fits the power-law model, in 
fact representing the best fit with the data in all cases analyzed in this study. Interestingly 
though the broad-scale model gives us a significantly much better fit. Maltose is another 
carbon source where gene expression distribution of E. coli shows significant fit with the 
power-law model. However, three other distributions, log-normal, Weibull and broad-scale 
provide significantly better fits than the power-law model. Of the two cases, in which the 
power-law model displays no significant differences from the empirical data but other models 
are also plausible, the classification of power-law character remains uncertain. None of the 
other mid-log cultures provide a significant fit with the power-law distribution. For glucose, 
galactose and glycerol, log-normal, Weibull and broad-scale are significantly better models 
than the power-law, while exponential is significanlty worse model. Worth noting is the 
graphical comparison of Fig. 1 c and Id. The acetate distribution displays no statistical 
difference with the power-law model while the galactose distribution does differ from it. This 
important difference in the nature of the empirical data is not revealed from the inspection 
of the graphical representations but uniquely from the statistical test. 

We also examined the transcriptome state of E. coli cells grown in chemostat cultures 
at differen t dilution rates , repre senting various steady-state growth rates at different culture 



densities ( jVazquez et al.l . 120081 ). We find that the general pattern for gene expression dis- 



tribution is largely invariant in all different growth media and at different growth rates, but 
for the steady-state cultures none of the empirical data sets fit the power-law model signif- 
icantly (Table SI 1 and Fig. SI 4). Curiously the broad-scale model outperforms for all 
steady-state cases. Also note that the log-normal and Weibull distributions display superior 
fits for dilutions 0.25 and 0.4. In all the cases, the exponential distribution fits the empirical 
data significantly worse than the power-law distribution. 

Protein expression values in mid-log phase of S. cerevisiae cell cultures from 3,868 dif- 
ferent strains expressing a s ingle GFP or TAP tagged protein in a rich growth media 



f lGhaemmaghami et al.l . 120031 ) displayed a slightly different expression mode (Fig. SI 5), 



suggesting difference in mRNA and protein expression value distributions under these growth 
conditions. Yet again the empirical dataset shows significant differences with respect to the 
power-law model (Table SI 2). Comparing models, the power-law distribution fits signifi- 
cantly better than the exponential, although similarly to the log-normal and Weibull distri- 
butions. However the broad-scale model offers a significantly better fit that the power-law 
model. 

Finally, to assess the transcriptome state of E. coli cells under the most complex growth 
conditions, we have examined the dynamic al microarray pr ofile of E. coli cells grown in 



batch culture in a mixed-substrate medium (IBeg et al.l . 120071 ) . At all sampled time points, 



the distribution of expression of the transcriptome was differing significantly from the power- 
law distribution (Table SI 3). In this mRNA expression data set the log- normal, Weibull 
and broad-scale models show better statistical fit with the experimental data at all time 
points, while the exponential distribution is also better for some of the cases. A graphical 
representation of the best fitting power-law distribution against the empirical data is shown 
in Fig. SI 6. 
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4 Discussion 



The structure and activity of intracellular molecular interaction networks represents a basic 
tenet of life. Thus it is of great importance to correctly identify their true structural and 
functional properties at the global, intermediate and local levels. Heavy-tailed probability 
distributions have been used in the literature to explain the topological features of complex 
biological networks. In this work we have analyzed to what extent, among others, three well 
known heavy-tailed distributions (power-law, broad-scale and log-normal) can be uniquely 
used to represent the observed experimental data for protein interaction, transcription reg- 
ulation, metabolic pathways and expression experiments. Our results demonstrate that the 
large-scale topology of the molecular interaction networks and the global mRNA and pro- 
tein expression distributions examined here do not strictly follow power-law distributions. 
Moreover, none of the three heavy-tailed models tested had a universal agreement with the 
empirical data even when using the highest quality data sets available. Distributions are 
evidently heavy-tailed and for this type of data MLE analyses prove superior to graphical 
methods for assessing different tested distributions. However, we could not assess any of 
the studied models with statistical reliability. Our results could be explained by several 
non-exclusive arguments. First, our analyses could be affected by a biased acquisition of 
the empirical data, a systematic error that would affect, for instance, proteins which have 
been studied more extensively because they have a high biomedical interest. Also, such sim- 
ple mathematical frameworks as considered here may not represent entirely the biological 
mec hanisms at work, but also be the convolution of the physicochemical constraints of the 
cell (IBeg et al.l . 120071 ). Finally, population- level evolutionary processes, such as activation 
of a foraging prog ram upon extrace llular substrate exhaustion, clearly affect the function of 
cellular networks ( IBeg et all 120071 ) and are likely to influence the nature of the underlying 
molecular interactions as well. 

Additional problems can have their origin in data acquisition. In some cases, the ex- 
perimental data might be incomplete or noisy, specially for the HTP data which derives 
partially from yeast two-hy brid experiments. There has been several efforts (IHuang et al. 



20071 ; IScholtens et all 120081 ) to infer statistically the actual PPI network from such experi- 



ments and consequently our results of the analysis refer to the data, not the actual network. 
Indeed, the analysis of the reconstructed network from literature, in a manually curated way, 
considered to be more correct, yields different results. The inference of the actual network 
is out of the scope of the current work. 

We should also note that the topology of cellular networks can be viewed at various 
levels of complexity. For example, in the metabolic representation consider ed here each 



meta bolite that participates in an interconversion reaction is considered equal (IJeong et al. 



20001 ) . However, in alternative representations molecules that only act as donor or acceptor 
(e.g., ATP or ADP) or th at do not con tribute a carbon or nitrogen atom to a metabolic 
reaction can be excluded f Arital. 120041) . or they can be pre- classified according to their 
perceived biochemical role ( Tanaka . 12005 ). Similarly, in TR networks, genes and their protein 
products are usually defined as common nodes with regulatory links among them being 
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mediated by the binding of TFs to the promoter regions of the genes (jShen-Orr et all 120021 ) . 
At other times, however, g enes and their protein products are considered as separate nodes 



fjYeger-Lotem et all 120041 ) . While certain prop e rties are unaff ected by alternative network 



representations, others are affected (lArital . 120041 ; iTanakal . 120051 ) potentially complicating the 
interpretation of subsequent analytical results. 

Finally, our analyses also reveal highly similar, but dynamically regulated global mRNA 
and protein expression profiles, in which expression values are significantly variable. Thus, 
while the global funct ion of cellular networks is expec t ed to be greatly i nfluen ced by their 



underlying topology ( lAlmaas et all 12004 lUeda et all 12004 ; iGoh et all 1200 ll ). mRNA or 



protein expression data reflect the dynamic physicochemical state of the cell, likely neces- 
sitating an explicit dynamical model for establishing a relationship between topology and 
expression. 
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a. 



b. 




Figure 1: The connectivity degree distribution of (a.) LC derived S. cerevisiae protein-protein interaction; 
(b.) out-degree distribution of the transcriptional regulatory network of S. cerevisiae and (c, d.) intensity 
distribution of mRNA expression values in E. coli cells from mid-log (OD 0.2) cultures with (c.) acetate 
and (d.) galactose as single carbon source. Points represent the empirical distribution, lines represent the 
most likely power-law models. 
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Table 1: Summary of probabilistic analyses for representative examples of the discrete data sets. The Supplementary Information includes 
the further results discussed in the text. The second column shows the p- value associated with the differences between the data and the 
power-law model. The next six columns correspond to the log-likelihood ratio tests comparing the power-law model with other plausible 
models. Positive values support the power-law model. The normalized log-likelihood ratio (NLLR) is used for non-nested functions while 
the raw log-likelihood ratio (LLR) is used for the power-law with exponential cutoff model, p-values here are associated with the differences 
between the two models. Significant p-values are denoted in bold. 
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Table 2: Summary of probabilistic analyses for representative examples of the continuous data sets. The 
Supplementary Information includes the other results discussed in the text. Interpretation of the table as 
for Table (TJ 
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1 Clustering Coefficient Analysis 



The clustering coefficient is denned as C(i) 



2e, 



i denoting a given node, ki the number of 



links of i, and e, the number of connections among the links. The clustering coefficient ranges from 
(low connected network) to 1 (highly connected network) . In relation to the connectivity distribu- 
tion, k, scale-free an d modular networks have been reported to display constant C(k ) distributions 



(Ra yasz et all |2002| ). while hierarchical networks display C(k) = k distributions (jRavasz et al 



2002). In our analyses, we do not find a conclusive agreement between any of the models and the 



experimental data (Figure [2]). 



2 Analysis Protocol 

In order to give a broad idea of the methods we used, w e explain here t he m athematical protocol 



we followed, which is the same as the one described in IClauset et al.l (|200S ). The proto c ol wa s 
newly developed by them and we used their own implementation in MATLAB dClauset et all . l200flh . 
We have applied it to the biological data sets of interest. 

The main idea of the protocol is to answer two very well defined questions: 

• Given the power-law model, which is the most probable parameter combination for a given 
data set? 

• Once we have an inferred model, is there any significative difference between the model and 
the data set? 

For the first question we use maximum likelihood methods to find the most probable parameters 
from a given model in order to explain the data set. In the case of the power-law model, we have 
two parameters, a, which is the scaling parameter and x m i n which is the lowest x from which the 
distribution behaves as power-law. The parameter x m i n is inferred in a special way as it represents 
one of the bounds of the distribution but d oes not affect the shape of it. The method used, described 
and implemented by Clauset et al. (2009), consists in a very practical approach: x m i n is chosen so 



the difference between the inferred model and the data set is minimized. The rationale behind it 
is that if x m i n is underestimated, the scaling parameter will be wrongly inferred while if x-min is 
overestimated the reduced data size will provoke random fluctuations on the estimation. 

Once we have in hands a data set and its corresponding inferred model, we want to know if there 
are differences between the data and the model in order to know whether or not the data set follows 
the power-law distribution. In this context of non-normality an appropriate and common test used 
to compare a reference distribution and an empirical distribution is the Kolmogorov-Smirnov (KS) 
test. However one of the premises of the KS is that the two distributions need to be independent. 
In our case the model is inferred from the same data set we want to compare it with and the premise 
of independency is not hold as correlations are introduced. Therefore Monte Carlo methods are 
used to generate synthetic data from the inferred model. A D statistic of the KS test is calculated 
for each synthetic data set. At this point we have a distribution of D values. The position of the 
D statistic calculated from the empirical data set and the inferred model, within the distribution 
of D values calculated from the Monte Carlo generated data sets, gives us the p- value associated to 
the empirical D. Specifically the way of calculating the p-value is simply counting the number of 
times the empirical D is larger than the synthetic D. As we are looking for deviations in one side 



2 



(larger than) the confidence value that should be used is p = 0.1. We have elaborated a simplistic 
graphical flow of the protocol to facilitate the comprehension of the methods used (Figure [5]). 
Finally, we have an additional question: 

• Is there any other model that fits better the data than the power-law distribution? 

Once we know to which extent is the power-law model able to explain the empirical data distribu- 
tion, we wonder if there is any other model that could equally well describe the experimental data. 
For that purpose, the normalized-log-likelihood ratio test is used. It is a simple test that compare 
the log-likelihood values of two competing distributions with respect to the empirical data. For the 
case of nested functions, which is the case of the broad-scale, the p-value should be corrected. 
All code and data sets necessary to reproduce the results are provided upon request as a tar 

file. 
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Figure 1: a., b. The degree distribution of connectivity of S. cerevisiae protein-protein interaction network; c, 
d. transcriptional regulatory network and e., f. the E. coli metabolic network. Points represent the empirical 
distribution, lines represent the most likely models. 
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Figure 2: a., b. Clustering coefficients (C(k)) for the degree distribution of nodes in the S. cerevisiae protein-protein 
interaction network built from HTP and LC data respectively; c. S. cerevisiae transcriptional regulatory network, 
and d. the E. coli metabolic network. Circles represent empirical distributions, while lines represent C(k) — fc _1 . 
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Figure 3: Intensity distribution of mRNA expression values in E. coli cells from mid-log (OD 0.2) cultures with 
different molecules as single carbon source: maltose, glycerol, glucose, galactose and acetate. Lines represent the best 
fitted power-law model. 
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Figure 4: Intensity distribution of mRNA expression values in E. coli cells from steady-state cultures with 0.2% 
glucose-restricted M9 minimal medium at different dilution rates: 0.1, 0.25, 0.4, 0.55 and 0.72. Lines represent the 
best fitted power-law model. 
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Figure 5: Protein expression values distribution in S. cerevisiae clones expressing individual GFP or TAP tagged 
proteins. Line represents the best fitted power-law model. 
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Figure 6: Activity distributions in a dynamical transcriptome. m RNA expression values distribution in E. colt cells in a time-series growth experiment 
(between 2-8 hrs.) on mixed-carbon substrate medium are shown (|Beg et all l2007h . Lines represent the best fitting models for power-law models. 
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Figure 7: Distribution of the D values for the 2,500 synthetic data sets generated using Monte Carlo. Due to 
computational costs we show here the graphs for 2,500 data sets, however, 250,000 have been used to calculate the 
p- values associated to the power- law estimation. Red dashed line indicates the confidence level at 0.1 and black 
dotted line indicates the experimental value. 
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Is my data power- law? 
1st 2nd 




Figure 8: Several steps are used to know if a data set follows a power-law distribution. We infer first the power-law 
parameters, a and x m in using MLE. We then use the KS test to find Do. Do value is retrieved from two correlated 
distributions, the experimental distribution and the model distribution inferred from the experimental data set. 
Therefore the p-value associated to it is not valid. We generate several (250,000 in this article) synthetic data sets 
from the inferred model using Monte Carlo. We repeat the previous step for each data set generated, MLE and KS. 
At this point we have a distribution of D synthetic values. The position of Do within the distribution of synthetic 
values provides the adequate p-value. The distribution of synthetic D distribution and Do values for the discrete 
interaction data sets are shown in Figure [7] 
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Table 1: Second column shows the p- value associated to the differences between the data and the power-law model. Next four columns correspond 
to the log likelihood ratio tests comparing the power-law model with other plausible models. The normalized log likelihood ratio (NLLR) is used for 
non-nested functions while the raw log likelihood ratio (LLR) is used for the power-law with exponential cutoff model, p-values here are associated to 
the differences between the two models. Significant p-values are denoted in bold. 
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Table 2: Second column shows the p- value associated the differences between the data and the power-law model. Next four columns correspond to the 
log likelihood ratio tests comparing the power-law model with other plausible models. The normalized log likelihood ratio (NLLR) is used for non-nested 
functions while the raw log likelihood ratio (LLR) is used for the power-law with exponential cutoff model, p- values here are associated to the differences 
between the two models. Significant p-values are denoted in bold. 
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Table 3: Second column shows the p-value associated the differences between the data and the power-law model. Next four columns correspond to the 
log likelihood ratio tests comparing the power-law model with other plausible models. The normalized log likelihood ratio (NLLR) is used for non-nested 
functions while the raw log likelihood ratio (LLR) is used for the power-law with exponential cutoff model, p-values here are associated to the differences 
between the two models. Significant p-values are denoted in bold. 



